The package ggplot2 by Hadley Wickham implements graphs based on the grammar of graphs ideas. The concept for this plotting framework was laid out by Leland Wilkinson in his 1999 book Grammar of Graphics.
It has two main principles:
There are seven grammatical elements in total, three are essential:
The non-essential ones include:
A good reference for ggplot can be found at http://ggplot2.tidyverse.org/reference/.
All plots are generated with the same function: ggplot(). However, this function only initializes a ggplot object. It can be used to declare the input data frame for a graphic and to specify the set of plot aesthetics intended to be common throughout all subsequent layers unless specifically overridden.
ggplot(data = NULL, mapping = aes(), ..., environment = parent.frame())
It has two main arguments:
data Default dataset to use for plot. If not already a data.frame, will be converted to one by fortify. If not specified, must be suppled in each layer added to the plot.
mapping Default list of aesthetic mappings to use for plot. If not specified, must be suppled in each layer added to the plot.
> library(ggplot2)
> p <- ggplot(data = mpg, mapping = aes(x = displ, y = hwy))
There are three common ways to invoke ggplot():
ggplot(df, aes(x, y, <other aesthetics>))
ggplot(df)
ggplot()
The first method is recommended if all layers use the same data and the same set of aesthetics, although this method can also be used to add a layer using data from another data frame. The second method specifies the default data frame to use for the plot, but no aesthetics are defined up front. This is useful when one data frame is used predominantly as layers are added, but the aesthetics may vary from one layer to another. The third method initializes a skeleton ggplot object which is fleshed out as layers are added. This method is useful when multiple data frames are used to produce different layers, as is often the case in complex graphics.
In order to see something, we have add “layers”. These are added by using the + operator. Behind the scenes it calls the layer() function to create a new layer. You will never use the layer function, but it is good for your understanding of how ggplot2 works.
p + layer(
mapping = NULL,
data = NULL,
geom = "point", geom_params = list(),
stat = "identity", stat_params = list(),
position = "identity"
)
This call fully specifies the five components to the layer:
mapping: A set of aesthetic mappings, specified using the aes() function and combined with the plot defaults. If NULL, uses the default mapping set in ggplot().
data: A dataset which overrides the default plot dataset. It is usually omitted (set to NULL), in which case the layer will use the default data specified in ggplot().
geom: The name of the geometric object to use to draw each observation. Geoms can have additional arguments. All geoms take aesthetics as parameters. If you supply an aesthetic (e.g. colour) as a parameter, it will not be scaled, allowing you to control the appearance of the plot.
stat: The name of the statistical tranformation to use. A statistical transformation performs some useful statistical summary is key to histograms and smoothes. To keep the data as is, use the “identity” stat. You only need to set one of stat and geom: every geom has a default stat, and every stat a default geom. Most stats take additional parameters to specify the details of statistical transformation.
position: The method used to adjusting overlapping objects, like jittering, stacking or dodging.
It’s useful to understand the layer() function so you have a understanding of the layer object. But you’ll rarely use the full layer() call because it’s so verbose. Instead, you’ll use the shortcut geom_ functions: geom_point(mapping, data, ...) is exactly equivalent to layer(mapping, data, geom = "point", ...).
> p + geom_point()
Every layer must have some data associated with it, and that data must be in a data frame. However, the data on each layer doesn’t need to be the same, and it’s often useful to combine multiple datasets in a single plot. Let us illustrate this idea:
> mod <- loess(hwy ~ displ, data = mpg)
> grid <- data.frame(displ = seq(min(mpg$displ), max(mpg$displ), length = 50))
> grid$hwy <- predict(mod, newdata = grid)
> std_resid <- resid(mod) / mod$s
> outlier <- subset(mpg, abs(std_resid) > 2)
> ggplot(mpg, aes(displ, hwy)) +
+ geom_point() +
+ geom_line(data = grid, colour = "blue", size = 1.5) +
+ geom_text(data = outlier, aes(label = model))
The aesthetic mappings, defined with aes(), describe how variables are mapped to visual properties or aesthetics. In simpler words, an aesthetic is “something you can see”. They include
aes() takes a sequence of aesthetic-variable pairs like this:
> aes(x = displ, y = hwy, colour = class)
Here we map x-position to displ, y-position to hwy, and colour to class. The names for the first two arguments can be ommitted, in which case they correspond to the x and y variables. That makes this specification equivalent to the one above:
> aes(displ, hwy, colour = class)
Aesthetic mappings can be supplied in the initial ggplot() call, in individual layers, or in some combination of both. All of these calls create the same plot specification:
> ggplot(mpg, aes(displ, hwy, colour = class)) +
+ geom_point()
> ggplot(mpg, aes(displ, hwy)) +
+ geom_point(aes(colour = class))
> ggplot(mpg, aes(displ)) +
+ geom_point(aes(y = hwy, colour = class))
> ggplot(mpg) +
+ geom_point(aes(displ, hwy, colour = class))
The aesthetics set in the ggplot() function are the default for all other layers. You can, however, add, override, or even remove mappings (e.g. aes(y=NULL)) in the layers.
> par(mfrow=c(1,2))
> ggplot(mpg, aes(displ, hwy, colour = class)) +
+ geom_point() +
+ geom_smooth(se = FALSE)
`geom_smooth()` using method = 'loess'
> ggplot(mpg, aes(displ, hwy)) +
+ geom_point(aes(colour = class)) +
+ geom_smooth(se = FALSE)
`geom_smooth()` using method = 'loess'
Geometric objects are the actual marks we put on a plot. We have already seen a few examples of geoms: geom_points, geom_line, and geom_smooth, however, there are many more. You can get a list of available geometric objects by the help.search() command:
> help.search("geom_", package = "ggplot2")
An excellent overview can be found at [http://sape.inf.usi.ch/quick-reference/ggplot2/geom]
There are a few rules to using geoms:
+ operatorgeom_point requires mappings for x and y, all others are optional. The geom_text accepts a labels mapping.Some plot types (such as scatterplots) do not require transformations–each point is plotted at x and y coordinates equal to the original value. Other plots, such as boxplots, histograms, prediction lines etc. require statistical transformations: For a boxplot the y values must be transformed to the median and 1.5(IQR). For a histogram, the x values must be transformed to the bin.
Each geom has a default statistic, but these can be changed. For example, the default statistic for geom_bar is stat_bin.
> args(geom_histogram)
function (mapping = NULL, data = NULL, stat = "bin", position = "stack",
..., binwidth = NULL, bins = NULL, na.rm = FALSE, show.legend = NA,
inherit.aes = TRUE)
NULL
> args(stat_bin)
function (mapping = NULL, data = NULL, geom = "bar", position = "stack",
..., binwidth = NULL, bins = NULL, center = NULL, boundary = NULL,
breaks = NULL, closed = c("right", "left"), pad = FALSE,
na.rm = FALSE, show.legend = NA, inherit.aes = TRUE)
NULL
Arguments to stat_ functions can be passed through geom_ functions. This can be slightly annoying because in order to change it you have to first determine which stat the geom uses, then determine the arguments to that stat.
> housing <- read.csv("/home/Fabian2/Desktop/StatCompMeta/Rgraphics/dataSets/landdata-states.csv")
> head(housing[1:5])
State region Date Home.Value Structure.Cost
1 AK West 2010.25 224952 160599
2 AK West 2010.50 225511 160252
3 AK West 2009.75 225820 163791
4 AK West 2010.00 224994 161787
5 AK West 2008.00 234590 155400
6 AK West 2008.25 233714 157458
>
> p2 <- ggplot(housing, aes(x = Home.Value))
> p2 + geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
by changing the binwidth argument in the stat_bin function we can obtain:
> p2 + geom_histogram(stat = "bin", binwidth=4000)
The coordinate system of a plot, together with the x and y position scale, determines the location of geoms. There are a few options:
coord_cartesian - (default) cartesian coordinate system (x horizontal from left to right, y vertical from bottom to top)coord_flip - flipped cartesian coordinate system (x vertical from bottom to top, y horizontal from left to right)coord_trans - transformed cartesion coordinate system, e.g. log transformcoord_equal - forces a fixed ratio of the physical representations on the x and y axiscoord_polar - polar coordinate system; the x (or y) scale is mapped to the angle (theta)coord_map - various map projections> d <- data.frame(height = c(1,2,2,3,4), weight = c(1,3,4,4,2))
> p <- ggplot() +
+ geom_line(data=d, mapping=aes(x=height, y=weight)) +
+ geom_point(data=d, mapping=aes(x=height, y=weight), size=8, fill="white", shape=21) +
+ geom_text(data=d,mapping=aes(x=height, y=weight, label=seq(1,nrow(d))))
> p + coord_cartesian()
> p + coord_flip()
> p + coord_trans(x="log10", y="log10")
> p + coord_equal()
> p + coord_polar(theta="x")
> p + coord_polar(theta="y")
Aesthetic mapping (i.e., with aes()) only says that a variable should be mapped to an aesthetic. It doesn’t say how that should happen. For example, when mapping a variable to shape with aes(shape = x) you don’t say what shapes should be used. Similarly, aes(color = z) doesn’t say what colors should be used. Describing what colors/shapes/sizes etc. to use is done by modifying the corresponding scale. In ggplot2 scales include
Scales are modified with a series of functions using a scale_<aesthetic>_<type> naming scheme. To get an overview type
> help.search("scale_", package = "ggplot2")
Let us consider an example of a dopplot showing the distribution of home values by Date and State.
>
> p3 <- ggplot(housing,
+ aes(x = State,
+ y = Home.Price.Index)) +
+ theme(legend.position="top",
+ axis.text=element_text(size = 6))
> (p4 <- p3 + geom_point(aes(color = Date),
+ alpha = 0.5,
+ size = 1.5,
+ position = position_jitter(width = 0.25, height = 0)))
Now, let us modify the breaks for the x axis and color scales.
> p4 + scale_x_discrete(name="State Abbreviation") +
+ scale_color_continuous(name="",
+ breaks = c(1976, 1994, 2013),
+ labels = c("'76", "'94", "'13"))
Next, change the low and high values to blue and red.
> p4 +
+ scale_x_discrete(name="State Abbreviation") +
+ scale_color_continuous(name="",
+ breaks = c(1976, 1994, 2013),
+ labels = c("'76", "'94", "'13"),
+ low = "blue", high = "red")
> library(scales)
> p4 +
+ scale_color_continuous(name="",
+ breaks = c(1976, 1994, 2013),
+ labels = c("'76", "'94", "'13"),
+ low = muted("blue"), high = muted("red"))
Faceting allows to create multiple graphs for subsets of data. ggplot2 offers two functions for creating small multiples
facet_wrap(): define subsets as the levels of a single grouping variablefacet_grid(): define subsets as the crossing of two grouping variables> p5 <- ggplot(housing, aes(x = Date, y = Home.Value))
> p5 + geom_line(aes(color = State))
There are two problems here. First, there are too many states to distinguish each one by color, and secondly, the lines obscure one another.
> (p5 <- p5 + geom_line() +
+ facet_wrap(~State, ncol=10))
The ggplot2 theme system handles non-data elements such as
All theme options are documented in ?theme. There are a number of built-in themes, e.g.
theme_gray() (default)theme_bw()theme_classic()> p5 + theme_linedraw()
> p5 + theme_light()
Specific theme elements can be overridden using theme().
> p5 + theme_minimal() +
+ theme(text = element_text(color = "turquoise"))
You can create new themes, as in the the following example:
> theme_new <- theme_bw() +
+ theme(plot.background = element_rect(size = 1, color = "blue", fill = "grey"),
+ text=element_text(size = 12, family = "Serif", color = "ivory"),
+ axis.text.y = element_text(colour = "pink"),
+ axis.text.x = element_text(colour = "red"),
+ panel.background = element_rect(fill = "pink"),
+ strip.background = element_rect(fill = "pink"))
>
> p5 + theme_new
> vec <- data.frame("x"=rnorm(100))
> ggplot(vec, aes(x=x)) +
+ geom_histogram(aes(y=..density..)) +
+ geom_rug() +
+ stat_function(fun = dnorm, colour = "red",
+ args = list(mean = mean(vec$x), sd = sd(vec$x)))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
> ggplot(data = diamonds) +
+ geom_histogram(mapping = aes(x = carat), binwidth = 0.5)
> vec <- data.frame("x"=rnorm(100))
> slope <- diff(quantile(vec$x, c(0.25, 0.75))) / diff(qnorm(c(0.25, 0.75)))
> int <- quantile(vec$x, 0.25) - slope * qnorm(0.25)
> ggplot(vec, aes(sample=x)) +
+ stat_qq() +
+ geom_abline(aes(slope = slope, intercept = int), col="red")
> library(dplyr)
> clss <- mpg %>%
+ group_by(class) %>%
+ dplyr::summarise(n = n(), hwy = mean(hwy))
>
> clss$n.spell <- paste0("n = ", clss$n)
>
> ggplot(data = mpg, aes(y = hwy, x = class)) +
+ geom_jitter() +
+ geom_point(data = clss, aes(x = class, y = hwy), col = "red", size = 4) +
+ geom_text(data = clss, aes(label = n.spell, y = 10 ))
> ggplot(data = diamonds) +
+ geom_bar(mapping = aes(x = cut))
Compared to base graphics, ggplot2
rgl package)igraph package)ggvis)A good way to compare to the two philosophies is to create the same plots with both aproaches. Let us have a look at a few excellent examples taken from the blog “Tufte in R” by Lukasz Piwek to be found under .
> x <- 1967:1977
> y <- c(0.5,1.8,4.6,5.3,5.3,5.7,5.4,5,5.5,6,5)
> plot(y ~ x, axes=F, xlab="", ylab="", pch=16, type="b")
> axis(1, at=x, label=x, tick=F, family="serif")
> axis(2, at=seq(1,6,1), label=sprintf("$%s", seq(300,400,20)), tick=F, las=2, family="serif")
> abline(h=6,lty=2)
> abline(h=5,lty=2)
> text(max(x), min(y)*2.5,"Per capita\nbudget expanditures\nin constant dollars", adj=1,
+ family="serif")
> text(max(x), max(y)/1.08, labels="5%", family="serif")
> library(ggplot2)
> library(ggthemes)
> x <- 1967:1977
> y <- c(0.5,1.8,4.6,5.3,5.3,5.7,5.4,5,5.5,6,5)
> d <- data.frame(x, y)
> ggplot(d, aes(x,y)) + geom_line() + geom_point(size=3) + theme_tufte(base_size = 15) +
+ theme(axis.title=element_blank()) + geom_hline(yintercept = c(5,6), lty=2) +
+ scale_y_continuous(breaks=seq(1, 6, 1), label=sprintf("$%s",seq(300,400,20))) +
+ scale_x_continuous(breaks=x,label=x) +
+ annotate("text", x = c(1977,1977.2), y = c(1.5,5.5), adj=1, family="serif",
+ label = c("Per capita\nbudget expandures\nin constant dollars", "5%"))
> library(devtools)
> source_url("https://raw.githubusercontent.com/sjmurdoch/fancyaxis/master/fancyaxis.R")
SHA-1 hash of file is 1eaf278aced5a1206f1c0bffab087d26f44e3f73
> x <- faithful$waiting
> y <- faithful$eruptions
> plot(x, y, main="", axes=FALSE, pch=16, cex=0.8,
+ xlab="Time till next eruption (min)", ylab="Duration (sec)",
+ xlim=c(min(x)/1.1, max(x)), ylim=c(min(y)/1.5, max(y)))
> axis(1, tick=F)
> axis(2, tick=F, las=2)
> axisstripchart(faithful$waiting, 1)
> axisstripchart(faithful$eruptions, 2)
> library(ggplot2)
> library(ggExtra)
> library(ggthemes)
> p <- ggplot(faithful, aes(waiting, eruptions)) + geom_point() + theme_tufte(ticks=F)
> ggMarginal(p, type = "histogram", fill="transparent")
> library(devtools)
> #install_github("leeper/slopegraph", force = TRUE)#install Leeper's package from Github
> library(slopegraph)
> data(cancer)
> slopegraph(cancer, col.lines = 'gray', col.lab = 1, col.num = 1,
+ xlim = c(-.2,5),
+ main = "Estimate of % survival rates",
+ xlabels = c('5 Year','10 Year','15 Year','20 Year'))
> library(ggplot2)
> library(ggthemes)
> library(devtools)
> library(RCurl)
> library(plyr)
> source_url("https://raw.githubusercontent.com/jkeirstead/r-slopegraph/master/slopegraph.r")
SHA-1 hash of file is 2b0f7bf51a5e18d2ea638a502ef449b6e654d27a
> d <- read.csv(text = getURL("https://raw.githubusercontent.com/jkeirstead/r-slopegraph/master/cancer_survival_rates.csv"))
> df <- build_slopegraph(d, x="year", y="value", group="group", method="tufte", min.space=0.04)
> df <- transform(df, x=factor(x, levels=c(5,10,15,20),
+ labels=c("5 years","10 years","15 years","20 years")), y=round(y))
> plot_slopegraph(df) + labs(title="Estimates of % survival rates") +
+ theme_tufte(base_size=16, ticks=F) + theme(axis.title=element_blank())
Loading required package: grid
Warning: `axis.ticks.margin` is deprecated. Please set `margin` property of
`axis.text` instead
Warning: `panel.margin` is deprecated. Please use `panel.spacing` property
instead